ETC5521: Diving Deeply into Data Exploration

Initial data analysis and model diagnostics

Professor Di Cook

Department of Econometrics and Business Statistics

The role of initial data analysis



The first thing to do with data is to look at them …. usually means tabulating and plotting the data in many different ways to see what’s going on. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later.

Crowder, M. J. & Hand, D. J. (1990) “Analysis of Repeated Measures”

Initial Data Analysis and Confirmatory Analysis

Prior to conducting a confirmatory data analysis, it is important to conduct an initial data analysis (IDA).

  • Confirmatory data analysis (CFA) is focused on statistical inference and includes procedures for:
    • hypothesis testing,
    • predictive modelling,
    • parameter estimation including uncertainty,
    • model selection.
  • IDA includes:
    • describing the data and collection procedures
    • scrutinise data for errors, outliers, missing observations
    • check assumptions for confirmatory data analysis

IDA is sometimes called preliminary data analysis.

IDA is related to exploratory data analysis (EDA) in the sense that it is primarily conducted graphically, and there are few formal tests available.

Taxonomies are useful but rarely perfect

Objectives of IDA?

The main objective for IDA is to intercept any problems in the data that might adversely affect the confirmatory data analysis.

  • The role of CFA is to answer the intended question(s) that the data were collected for.
  • IDA is often unreported in the data analysis reports or scientific papers, for various reasons. It might not have been done, or it may have been conducted but there was no space in the paper to report on it.

IDA in government statistics

The purpose of data cleaning is to bring data up to a level of quality such that it can reliably be used for the production of statistical models or statements.

A statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described.

van der Loo & de Jonge (2018) Statistical Data Cleaning with Applications in R

IDA in health and medical data

Huebner et al (2018)’s six steps of IDA: (1) Metadata setup, (2) Data cleaning, (3) Data screening, (4) Initial reporting, (5) Refining and updating the analysis plan, (6) Reporting IDA in documentation.

Heed these words

IDA prepares an analyst for CFA. One needs to be careful about NOT compromising the inference.

How do you compromise inference?

  1. Change your inference or questions based on what you find in IDA.
  2. Outlier removal or not.
  3. Missing value imputation choices.
  4. Treatment of zeros.
  5. Handling of variable type, categorical temporal.
  6. Lack of multivariate relationship checking, including subsets based on levels of categorical variables.
  7. Choosing variables and observations.

How do you avoid these errors?

  • Document ALL the IDA, using a reproducible analysis script.
  • Pre-register your CFA plan, so that your CFA questions do not change.
  • Decisions made on outlier removal, variable selection, recoding, sampling, handling of zeros have known affects on results, and are justifiable.

Insure yourself against accusations of data snooping, data dredging, data fishing.

Data screening

Data screening

It’s important to check how the data are understood by the computer.

that is, checking for data type:

  • Was the date read in as character?
  • Was a factor read in as numeric?

Also important for making inference is to know whether the data supports making broader conclusions. How was the data collected? Is it clear what the population of interest is, and that the data is a representative sample of this population?

Example: Checking the data type (1/2)

lecture3-example.xlsx

library(readxl)
library(here)
df <- read_excel(here("data/lecture3-example.xlsx"))
df
# A tibble: 5 × 4
     id date                loc       temp
  <dbl> <dttm>              <chr>    <dbl>
1     1 2010-01-03 00:00:00 New York  42  
2     2 2010-02-03 00:00:00 New York  41.4
3     3 2010-03-03 00:00:00 New York  38.5
4     4 2010-04-03 00:00:00 New York  41.1
5     5 2010-05-03 00:00:00 New York  39.8
  • What problems are there with the computer’s interpretation of data type?
  • What context specific issues indicate incorrect computer interpretation?

Example: Checking the data type (2/2)

library(lubridate)
df <- read_excel(here("data/lecture3-example.xlsx"), 
                 col_types = c("text", 
                               "date", 
                               "text",
                               "numeric"))

df |> 
  mutate(id = as.factor(id),
         date = ydm(date)) |>
  mutate(
         day = day(date),
         month = month(date),
         year = year(date)) 
# A tibble: 5 × 7
  id    date       loc       temp   day month  year
  <fct> <date>     <chr>    <dbl> <int> <dbl> <dbl>
1 1     2010-03-01 New York  42       1     3  2010
2 2     2010-03-02 New York  41.4     2     3  2010
3 3     2010-03-03 New York  38.5     3     3  2010
4 4     2010-03-04 New York  41.1     4     3  2010
5 5     2010-03-05 New York  39.8     5     3  2010
  • id is now a factor instead of integer
  • day, month and year are now extracted from the date
  • Is it okay now?
  • In the United States, it’s common to use the date format MM/DD/YYYY (gasps) while the rest of the world commonly uses DD/MM/YYYY or better still YYYY/MM/DD.
  • It’s highly probable that the dates are 1st-5th March and not 3rd of Jan-May.

Example: Specifying the data type with R

  • You can robustify your workflow by ensuring you have a check for the expected data type in your code.
xlsx_df <- read_excel(here("data/lecture3-example.xlsx"),
                 col_types = c("text", "date", "text", "numeric")) %>% 
  mutate(id = as.factor(id), 
         date = as.character(date),
         date = as.Date(date, format = "%Y-%d-%m"))
  • read_csv has a broader support for col_types
csv_df <- read_csv(here::here("data/lecture3-example.csv"),
                 col_types = cols(
                      id = col_factor(),
                      date = col_date(format = "%m/%d/%y"),
                      loc = col_character(),
                      temp = col_double()))
  • The checks (or coercions) ensure that even if the data are updated, you can have some confidence that any data type error will be picked up before further analysis.

Example: Checking the data type with R

You can have a quick glimpse of the data type with:

dplyr::glimpse(xlsx_df)
Rows: 5
Columns: 4
$ id   <fct> 1, 2, 3, 4, 5
$ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-0…
$ loc  <chr> "New York", "New York", "New York", "New Yor…
$ temp <dbl> 42, 41, 38, 41, 40
dplyr::glimpse(csv_df)
Rows: 5
Columns: 4
$ id   <fct> 1, 2, 3, 4, 5
$ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-0…
$ loc  <chr> "New York", "New York", "New York", "New Yor…
$ temp <dbl> 42, 41, 38, 41, 40

Example: Checking the data type visually

You can also visualise the data type with:

library(visdat)
vis_dat(xlsx_df)

library(inspectdf)
inspect_types(xlsx_df) %>% 
  show_plot()

Data cleaning

Data cleaning (1/2)

Data quality checks should be one of the first steps in the data analysis to assess any problems with the data.

These include using common or domain knowledge to check if the recorded data have sensible values.

  • Are positive values, e.g. height and weight, recorded as positive values with a plausible range?
  • If the data are counts, do the recorded values contain non-integer values?
  • For compositional data, do the values add up to 100% (or 1)? If not, is that a measurement error or due to rounding? Or is another variable missing?
  • Does the data contain only positives, ie disease occurrences, or warranty claims? If so, what would the no report group look like?

Data cleaning (2/2)

In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data, for example,

  • Does the treatment group have different demographic characteristics to the control group?
  • Are the distributions similar between the or training and test sets?
  • Does the distribution of the data imply violations of assumptions for the CFA, such as
    • non-normality,
    • discrete rather real-valued, or
    • different variance in different domains?

Data scrutinizing is a process that you get better at with practice and have familiarity with the domain area.

Example: Checking the data quality

# A tibble: 9 × 4
  id    date       loc        temp
  <fct> <date>     <chr>     <dbl>
1 1     2010-03-01 New York   42  
2 2     2010-03-02 New York   41.4
3 3     2010-03-03 New York   38.5
4 4     2010-03-04 New York   41.1
5 5     2010-03-05 New York   39.8
6 6     2020-03-01 Melbourne  30.6
7 7     2020-03-02 Melbourne  17.9
8 8     2020-03-03 Melbourne  18.6
9 9     2020-03-04 <NA>       21.3
  • Numerical or graphical summaries or even just eye-balling the data helps to uncover some data quality issues.
  • Any issues here?



  • There’s a missing value in loc.
  • Temperature is in Farenheit for New York but Celsius in Melbourne (you can validate this again using external sources).

Case study: World development indicators Part 1/3

raw_dat <- read_csv(here::here("data/world-development-indicators.csv"), na = "..",
  n_max = 11935)

country_code_df <- raw_dat %>%
  distinct(`Country Name`, `Country Code`) %>%
  rename_all(janitor::make_clean_names) %>%
  left_join(
    countrycode::codelist %>% select(iso3c, region, continent),
    by = c("country_code" = "iso3c")
  ) %>%
  arrange(continent, region) %>%
  mutate(country_code = fct_infreq(country_code))
# series_code_df <- raw_dat %>%
#   distinct(`Series Name`, `Series Code`) %>%
#   rename_all(janitor::make_clean_names) %>%
#   mutate(series_code = janitor::make_clean_names)
wdi <- raw_dat %>%
  gather(key = "Year", value = "value", `1969 [YR1969]`:last_col()) %>%
  rename_all(janitor::make_clean_names) %>%
  separate(year, into = c("year", "year_string"), sep = " ") %>%
  mutate(year = as.integer(year)) %>%
  select(-contains("name"), -year_string) %>%
  spread(key = series_code, value = value) %>%
  mutate(
    country_code = factor(country_code, levels = country_code_df$country_code)
  ) %>%
  rename_all(janitor::make_clean_names)

wdi2017 <- wdi %>% filter(year == 2017)

library(visdat)
vis_dat(wdi2017)



- What are the data types? - How are missings distributed? - Which variables have insufficient values to analyse further?

World Development Indicators (WDI), sourced from the World Bank Group (2019)

Case study: World development indicators Part 2/3

library(naniar)
ggplot(wdi2017, 
       aes(x=en_pop_dnst, y=sp_urb_grow)) + 
  geom_miss_point(size=4, alpha=0.6) + 
  scale_color_discrete_qualitative() +
  theme(aspect.ratio=1)



en_pop_dnst = Population density (people per sq. km of land area)

sp_urb_grow = Urban population growth (annual %)



  • How are missings distributed?
  • Is there a relationship between population density and urban growth?

is there a better way to plot this to see relationship?

Case study: World development indicators Part 3/3

library(naniar)
ggplot(wdi2017, 
       aes(x=en_pop_dnst, y=sp_urb_grow)) + 
  geom_miss_point(size=4, alpha=0.6) + 
  scale_color_discrete_qualitative() +
  scale_x_log10() +
  theme(aspect.ratio=1)



en_pop_dnst = Population density (people per sq. km of land area)

sp_urb_grow = Urban population growth (annual %)



- Is there a relationship between population density and urban growth?

Case study: Employment Data in Australia Part 1/3

Below is the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series.


library(readabs)
# lfs_1 <- read_abs("6202.0", tables = 1)  %>% 
#  separate_series()
employed <- read_abs(series_id = "A84423085A") %>% 
  mutate(month = lubridate::month(date),
         year = factor(lubridate::year(date))) %>% 
  filter(year != "2020") %>% 
  select(date, month, year, value) 
# Employed total ;  Persons ; original
# read_abs(series_id = "A84423043C") # seasonally adjusted
# A84423127L # trend
#glimpse(employed)
glimpse(employed)
Rows: 545
Columns: 4
$ date  <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05…
$ month <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,…
$ year  <fct> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 19…
$ value <dbl> 5986, 6041, 6054, 6038, 6031, 6036, 6005, 60…

Australian Bureau of Statistics, 2020, Labour force, Australia, Table 01. Labour force status by Sex, Australia - Trend, Seasonally adjusted and Original, viewed 2024-08-02]{.smallest}

Case study: Employment Data in Australia Part 2/3

Do you notice anything?

employed  %>% 
  ggplot(aes(month, value, color = year)) + 
  geom_line() + 
  ylab("employed ('000)") +
  scale_x_continuous("month", breaks=seq(1, 12, 1)) +
  ggtitle("1978 Feb - 2019 Dec") + 
  scale_color_discrete_qualitative() 

Why do you think the number of people employed is going up each year?

???

  • Australian population is 25.39 million in 2019
  • 1.5% annual increase in population
  • Vic population is 6.681 million (Sep 2020) - 26%
  • NSW population is 8.166 (Sep 2020) - 32%

Case study: Employment Data in Australia Part 3/3

.grid[.item[

df3 <- employed  %>% 
  dplyr::filter(as.numeric(as.character(year)) > 2008) %>% 
  mutate(month = factor(month)) 
ggplot(df3, aes(month, value, group = year)) + 
  geom_line() + 
  ylab("employed ('000)") +
  geom_text(data = filter(df3, month==12) %>% 
              mutate(month = ifelse(year %in% c(2011, 2013), 14, month)), 
            aes(label = year), hjust = -0.4, size = 5) +
  ggtitle("2009 Jan - 2019 Dec") +
  coord_cartesian(clip = 'off') + 
  theme(plot.margin = unit(c(1,4,1,1), "lines")) 

] .item[

{{content}}

] ]

  • There’s a suspicious change in August numbers from 2014.
df3 %>% 
  filter(month %in% 8:9) %>% 
  pivot_wider(year, names_from = month) %>% 
  mutate(diff = `9` - `8`) %>% 
  ggplot(aes(year, diff)) + 
  geom_point() + 
  geom_line(group = 1) +
  guides(x = guide_axis(n.dodge = 2)) + 
  labs(y = "Difference (Sep - Aug)")

  • A potential explanation for this is that there was a change in the survey from 2014.

Also see https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/

Data collection is appropriate

Example: Experimental layout and data Part 1/2

# to generate data that looks like it's non-randomised
rothamsted.brussels %>% 
  arrange(trt) %>% 
  mutate(row = rep(1:6, each = 8),
         col = rep(1:8, times = 6)) %>% 
  write_csv("data/lecture3-example3.csv")

lecture3-example3.csv

df3 <- read_csv(here::here("data/lecture3-example3.csv"),
                col_types = cols(
                  row = col_factor(),
                  col = col_factor(),
                  yield = col_double(),
                  trt = col_factor(),
                  block = col_factor()))
skimr::skim(df3)
── Data Summary ────────────────────────
                           Values
Name                       df3   
Number of rows             48    
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   4     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 row                   0             1 FALSE          6
2 col                   0             1 FALSE          8
3 trt                   0             1 FALSE          9
4 block                 0             1 FALSE          4
  top_counts                     
1 1: 8, 2: 8, 3: 8, 4: 8         
2 1: 6, 2: 6, 3: 6, 4: 6         
3 non: 16, hi : 4, hi : 4, hi : 4
4 B3: 12, B1: 12, B2: 12, B4: 12 

── Variable type: numeric ──────────────────────────────────
  skim_variable n_missing complete_rate mean   sd  p0 p25
1 yield                 0             1 246. 16.0 204 237
  p50  p75 p100 hist 
1 248 257.  273 ▂▂▇▇▅

Example: Experimental layout and data Part 2/2

.grid[ .item[

ggplot(df3, aes(col, row, fill = trt)) + 
  geom_tile(color = "black", size = 1) + 
  coord_equal() +
  scale_fill_discrete_qualitative()

ggplot(df3, aes(col, row, fill = yield)) + 
  geom_tile(color = "black", size = 1) + 
  coord_equal() +
  scale_fill_continuous_sequential(palette = "Greens 3")

] .item[

  • The experiment tests the effects of 9 fertilizer treatments on the yield of brussel sprouts on a field laid out in a rectangular array of 6 rows and 8 columns.
df3 %>% 
  mutate(trt = fct_reorder(trt, yield)) %>% 
  ggplot(aes(trt, yield)) + 
  geom_point(size = 4, alpha = 1 / 2) + 
  guides(x = guide_axis(n.dodge = 2))

  • High sulphur and high manure seems to be the best for the yield of brussel sprouts.
  • Any issues here?

] ]

Summary of checks for design experiments

  • Check if experimental layout given in the data and the description match
  • In particular, have a check with a plot to see if treatments are randomised.

Imputing missing values

Example 1: olympic medals

What’s missing

Example 2: TAO

Validators

Case study: Dutch supermarket revenue and cost Part 1/3

  • Data contains the revenue and cost (in Euros) for 60 supermarkets
  • Data has been anonymised and distorted
data("SBS2000", package = "validate")
dplyr::glimpse(SBS2000)
Rows: 60
Columns: 11
$ id          <fct> RET01, RET02, RET03, RET04, RET05, RET…
$ size        <fct> sc0, sc3, sc3, sc3, sc3, sc0, sc3, sc1…
$ incl.prob   <dbl> 0.02, 0.14, 0.14, 0.14, 0.14, 0.02, 0.…
$ staff       <int> 75, 9, NA, NA, NA, 1, 5, 3, 6, 5, 5, 5…
$ turnover    <int> NA, 1607, 6886, 3861, NA, 25, NA, 404,…
$ other.rev   <int> NA, NA, -33, 13, 37, NA, NA, 13, NA, N…
$ total.rev   <int> 1130, 1607, 6919, 3874, 5602, 25, 1335…
$ staff.costs <int> NA, 131, 324, 290, 314, NA, 135, NA, 1…
$ total.costs <int> 18915, 1544, 6493, 3600, 5530, 22, 136…
$ profit      <int> 20045, 63, 426, 274, 72, 3, 1, 75, 110…
$ vat         <int> NA, NA, NA, NA, NA, NA, 1346, NA, NA, …

Case study: Dutch supermarket revenue and cost Part 2/3

  • Checking for completeness of records
library(validate)
rules <- validator(
          is_complete(id),
          is_complete(id, turnover),
          is_complete(id, turnover, profit))
out <- confront(SBS2000, rules)
summary(out)
  name items passes fails nNA error warning
1   V1    60     60     0   0 FALSE   FALSE
2   V2    60     56     4   0 FALSE   FALSE
3   V3    60     52     8   0 FALSE   FALSE
                         expression
1                   is_complete(id)
2         is_complete(id, turnover)
3 is_complete(id, turnover, profit)

Case study: Dutch supermarket revenue and cost Part 3/3

  • Sanity check derived variables
library(validate)
rules <- validator(
    total.rev - profit == total.costs,
    turnover + other.rev == total.rev,
    profit <= 0.6 * total.rev
)
out <- confront(SBS2000, rules)
summary(out)
  name items passes fails nNA error warning
1   V1    60     39    14   7 FALSE   FALSE
2   V2    60     19     4  37 FALSE   FALSE
3   V3    60     49     6   5 FALSE   FALSE
                                      expression
1 abs(total.rev - profit - total.costs) <= 1e-08
2 abs(turnover + other.rev - total.rev) <= 1e-08
3              profit - 0.6 * total.rev <= 1e-08

Take away messages

  • Check your data:
    • by validating the variable types
    • with independent or external sources
    • by checking the data quality
  • Check if the data collection method has been consistent
  • Check if experimental layout given in the data and the description match
  • Consider if or how data were derived

Hypothesis Testing and Predictive Modeling Part 3/3

  • Hypothesis testing: usually make assumptions about the distribution of the data, and are formed relative to a parameter.
  • Predictive modeling: form of the relationship, distribution of the errors.

Hypothesis testing in R REVIEW Part 1/3

  • State the hypothesis (pair), e.g. \(H_o: \mu_1 = \mu_2\) vs \(H_a: \mu_1 < \mu_2\).
  • Test statistic depends on assumption about the distribution, e.g. 
    • \(t\)-test will assume that distributions are normal, or small departures from if we have a large sample.
    • two-sample might assume both groups have the same variance
  • Steps to complete:
    • Compute the test statistic
    • Measure it against a standard distribution
    • If it is extreme, \(p\)-value is small, decision is to reject \(H_o\)
    • \(p\)-value is the probability of observing a value as large as this, or large, assuming \(H_o\) is true.

Example: Checking variance and distribution assumption Part 1/2

data(sleep)
ggplot(sleep, aes(x=group, y=extra)) + 
  geom_boxplot() +
  geom_point(colour="orange")

ggplot(sleep, aes(x=extra)) + 
  geom_density(fill="orange", colour="orange", alpha=0.6) + 
  geom_rug(outside = TRUE, colour="orange") +
  coord_cartesian(clip = "off") +
  facet_wrap(~group)

Cushny, A. R. and Peebles, A. R. (1905) The action of optical isomers: II hyoscines. The Journal of Physiology 32, 501–510.

Example: Hypothesis test Part 2/2

tt <- with(sleep,
     t.test(extra[group == 1],
            extra[group == 2], 
            paired = TRUE))
tt$estimate
mean difference 
           -1.6 
tt$null.value
mean difference 
              0 
tt$statistic

t -4.1

tt$p.value

[1] 0.0028

tt$conf.int

[1] -2.5 -0.7 attr(,“conf.level”) [1] 0.95

Cushny, A. R. and Peebles, A. R. (1905) The action of optical isomers: II hyoscines. The Journal of Physiology 32, 501–510.

Example: Checking distribution assumption

InsectSprays %>% 
  ggplot(aes(x=fct_reorder(spray, count), 
             y=count)) + 
  geom_jitter(width=0.1, height=0, colour="orange", size=3, alpha=0.8) +
  xlab("") 

Can you see any violations of normality? Or equal variance?]

fm1 <- aov(count ~ spray, data = InsectSprays)
summary(fm1)
            Df Sum Sq Mean Sq F value Pr(>F)    
spray        5   2669     534    34.7 <2e-16 ***
Residuals   66   1015      15                   
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Write down the hypothesis being tested. What would the decision be?

Linear models in R REVIEW Part 1/3

library(tidyverse)
library(broom)
glimpse(cars)
Rows: 50
Columns: 2
$ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12…
$ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14…
ggplot(cars, aes(speed, dist)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Linear models in R REVIEW Part 2/3

  • We can fit linear models in R with the lm function:
lm(dist ~ speed, data = cars)

is the same as

lm(dist ~ 1 + speed, data = cars)
  • The above model is mathematically written as \[y_i = \beta_0 + \beta_1 x_i + e_i\] where
    • \(y_i\) and \(x_i\) are the stopping distance (in ft) and speed (in mph), respectively, of the \(i\)-th car;
    • \(_0\) and \(_1\) are intercept and slope, respectively; and
    • \(e_i\) is the random error; usually assuming \(e_i NID(0, ^2)\).

Linear models in R REVIEW Part 3/3

ggplot(cars, aes(speed, dist)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

fit <- lm(dist ~ 1 + speed, data = cars)
tidy(fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -17.6      6.76      -2.60 1.23e- 2
2 speed           3.93     0.416      9.46 1.49e-12
glance(fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>
1     0.651         0.644  15.4      89.6 1.49e-12     1
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>
  • Assuming this model is appropriate, the stopping distance increases by about 4 ft for increase in speed by 1 mph.

Model form Part 1/2

  • Say, we are interested in characterising the price of the diamond in terms of its carat.
ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 1/5) + 
  ggtitle("Diamonds") #+ 

  #geom_smooth(method = "lm", se = FALSE) + 
  #ylim(0, max(diamonds$price))
  • Looking at this plot, would you fit a linear model with formula
price ~ 1 + carat?

Model form Part 1/2

  • Say, we are interested in characterising the price of the diamond in terms of its carat.
ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 1/5) + 
  ggtitle("Diamonds") + 
  geom_smooth(method = "lm", se = FALSE, size = 2) + 
  ylim(0, max(diamonds$price))

  • Looking at this plot, would you fit a linear model with formula
price ~ 1 + carat?

Model form Part 2/2

ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 1/5) + 
  ggtitle("Diamonds") + 
  geom_smooth(method = lm) +
  geom_smooth(method = lm, 
              formula = y ~ poly(x, 2),
              colour = "orange") +
  ylim(0, max(diamonds$price))

  • What about
    price ~ poly(carat, 2)?
    which is the same as fitting:

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.\]

  • Should the assumption for error distribution be modified if so?
  • Should we make some transformation before modelling?
  • Are there other candidate models?

Model form Part 2/2

  • Notice that there was no formal statistical inference when trying to determine an appropriate model form.
  • The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:
    • formal inference for model selection;
    • justification of the selected “final” model; and
    • fitting the final model.
  • There may be in fact many, many models considered but discarded at the IDA stage.
  • These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it’s important to remind yourself what might not be reported.

Model selection

All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data

—Chatfield (1985)

All models are wrong but some are useful

—George Box

Model diagnostics

Residuals 1/2

library(broom)
d_fit1 <- lm(price ~ carat, data=diamonds)
d_fit2 <- lm(price ~ poly(carat, 2), data=diamonds)
d_fit3 <- lm(price ~ poly(carat, 3), data=diamonds)
d_fit4 <- lm(price ~ poly(carat, 4), data=diamonds)

d_res1 <- augment(d_fit1, diamonds)
d_res2 <- augment(d_fit2, diamonds)
d_res3 <- augment(d_fit3, diamonds)
d_res4 <- augment(d_fit4, diamonds)
 
r1 <- ggplot(d_res1, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res1[sample(1:nrow(d_res1), 5000),], method = "loess", colour="orange", se=F) +
  scale_y_continuous(".resid", breaks=seq(-20000, 10000, 10000), labels=seq(-20, 10, 10)) +
  ggtitle("Linear")  
r2 <- ggplot(d_res2, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res2[sample(1:nrow(d_res2), 5000),], method = "loess", colour="orange", se=F) +
  scale_y_continuous(".resid", breaks=seq(-20000, 10000, 10000), labels=seq(-20, 10, 10)) +
  ggtitle("Quadratic") 
r3 <- ggplot(d_res3, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res3[sample(1:nrow(d_res3), 5000),], method = "loess", colour="orange", se=F) +
  scale_y_continuous(".resid", breaks=seq(-10000, 30000, 10000), labels=seq(-10, 30, 10)) +
  ggtitle("Cubic") 
r4 <- ggplot(d_res4, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res4[sample(1:nrow(d_res4), 5000),], method = "loess", colour="orange", se=F) +
  scale_y_continuous(".resid", breaks=seq(-10000, 10000, 10000), labels=seq(-10, 10, 10)) +
  ggtitle("Quartic") 

r1 + r2 + r3 + r4 + plot_layout(ncol=2)


Residual = Observed - Fitted



Residual plot: Plot the residual against explanatory variable (or Fitted value)

Best residual plot has not obvious pattern.

Alternative approach: linearise relationship

ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 1/5) + 
  ggtitle("Diamonds") + 
  geom_smooth(method = lm) +
  scale_x_sqrt() +
  scale_y_sqrt() + ggtitle("Transform both x, y by sq root")

ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 1/5) + 
  ggtitle("Diamonds") + 
  geom_smooth(method = lm) +
  scale_x_log10() +
  scale_y_log10() + ggtitle("Transform both x, y by log10")

The log transformation of both variables linearises the relationship, so that a simple linear model can be used, and also corrects the heteroskedasticity.

Residuals 2/2

library(broom)
diamonds <- diamonds %>%
  mutate(sqprice = sqrt(price),
         sqcarat = sqrt(carat),
         lprice = log10(price),
         lcarat = log10(carat))
d_fit5 <- lm(sqprice ~ sqcarat, data=diamonds)
d_fit6 <- lm(lprice ~ lcarat, data=diamonds)

d_res5 <- augment(d_fit5, diamonds)
d_res6 <- augment(d_fit6, diamonds)

r5 <- ggplot(d_res5, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res5[sample(1:nrow(d_res5), 5000),], method = "loess", colour="orange", se=F) +
  ggtitle("Square root")  
r6 <- ggplot(d_res6, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res6[sample(1:nrow(d_res6), 5000),], method = "loess", colour="orange", se=F) +
  ggtitle("Log10") 
r7 <- ggplot(d_res5, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res5[sample(1:nrow(d_res5), 5000),], method = "loess", colour="orange", se=F) +
  xlim(c(0, 3)) +
  ggtitle("Square root")  
r8 <- ggplot(d_res6, aes(carat, .resid)) + 
  geom_hline(yintercept=0, colour="grey70") +
  geom_point(alpha = 1/5) +
  geom_smooth(data=d_res6[sample(1:nrow(d_res6), 5000),], method = "loess", colour="orange", se=F) +
  xlim(c(0, 3)) +
  ggtitle("Log10") 

r5 + r6 + r7 + r8 + plot_layout(ncol=2)


Which has the best residual plot?

“Teaching of Statistics should provide a more balanced blend of IDA and inference” Chatfield (1985)

Yet there is still very little emphasis of it in teaching and also at times in practice.


So don’t forget to do IDA!

Take away messages

  • Initial data analysis (IDA) is a model-focused exploration to support a confirmatory analysis with:
    • data description and collection
    • data quality checking, and
    • checking assumptions
    • model fit without any formal statistical inference.
  • IDA may never see the limelight BUT it forms the foundation that the main analysis is built upon. Do it well!

Further reading